Distinct Sampling on Streaming Data with Near-Duplicates
نویسندگان
چکیده
In this paper we study how to perform distinct sampling in the streaming model where data contain near-duplicates. The goal of distinct sampling is to return a distinct element uniformly at random from the universe of elements, given that all the near-duplicates are treated as the same element. We also extend the result to the sliding window cases in which we are only interested in the most recent items. We present algorithms with provable theoretical guarantees for datasets in the Euclidean space, and also verify their effectiveness via an extensive set of experiments.
منابع مشابه
Tight Bounds for Data Stream Algorithms and Communication Problems
In this thesis, we give efficient algorithms and near-tight lower bounds for the following problems in the streaming model. Improving on the works of Monemizadeh and Woodruff from SODA’10 and Andoni, Krauthgamer and Onak from FOCS’11, we give Lp-samplers requiring O( −p log n) space for p ∈ (1, 2). Our algorithm also works for p ∈ [0, 1], taking Õ( −1 log n) space. As an application of our samp...
متن کاملStreaming Quotient Filter: A Near Optimal Approximate Duplicate Detection Approach for Data Streams
The unparalleled growth and popularity of the Internet coupled with the advent of diverse modern applications such as search engines, on-line transactions, climate warning systems, etc., has catered to an unprecedented expanse in the volume of data stored world-wide. Efficient storage, management, and processing of such massively exponential amount of data has emerged as a central theme of rese...
متن کاملWeb-Scale Near-Duplicate Search: Techniques and Applications
A s the bandwidth accessible to average users has increased, audiovisual material has become the fastest growing datatype on the Internet. The impressive growth of the social Web, where users can exchange user-generated content, contributes to the overwhelming number of multimedia files available. Among these huge volumes of data, a large numbers of near duplicates and copies exist. File copies...
متن کاملAn Approximate Duplicate-Elimination in RFID Data Streams Based on d-Left Time Bloom Filter
Article history: Received 6 March 2010 Received in revised form 16 July 2011 Accepted 18 July 2011 Available online 31 July 2011 The RFID technology has been applied to a wide range of areas since it does not require contact in detecting RFID tags. However, due to the multiple readings in many cases in detecting an RFID tag and the deployment of multiple readers, RFID data contains many duplica...
متن کاملAdvanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams
Data intensive applications and computing has emerged as a central area of modern research with the explosion of data stored world-wide. Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse...
متن کامل